import numpy as np
import pandas as pd
Question: Please refer the table below to answer below questions:
data = [['Yes',400,100,500],['No',200,1300,1500],['Total',600,1400,2000]]
df = pd.DataFrame(data,columns = ['Planned to purchase Product A', 'Actually placed and order for Product A - Yes', 'Actually placed and order for Product A - No', 'Total'])
df = df.set_index(['Planned to purchase Product A'])
df
| Actually placed and order for Product A - Yes | Actually placed and order for Product A - No | Total | |
|---|---|---|---|
| Planned to purchase Product A | |||
| Yes | 400 | 100 | 500 |
| No | 200 | 1300 | 1500 |
| Total | 600 | 1400 | 2000 |
Ans1_1 = df['Actually placed and order for Product A - Yes']['Yes']/df['Total']['Total']
Ans1_2 = df['Actually placed and order for Product A - Yes']['Yes']/df['Total']['Yes']
print('The joint probability of people who planned to purchase and actually placed an order is',Ans1_1)
print('\nThe joint probability of people who planned to purchase and actually placed an order is given that the people planned to purchase',Ans1_2)
The joint probability of people who planned to purchase and actually placed an order is 0.2 The joint probability of people who planned to purchase and actually placed an order is given that the people planned to purchase 0.8
Question: An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.
A. Probability that none of the items are defective?
B. Probability that exactly one of the items is defective?
C. Probability that two or fewer of the items are defective?
D. Probability that three or more of the items are defective ?
import scipy.stats as stats
import matplotlib.pyplot as plt
# The four questions above can be answered with Binomial Distribution since there are two outcomes for every item
# defective and not defective with given failure and pass rate/probabilities
failure_rate = 0.05
n = 10
k = np.arange(0,11)
probabilities = stats.binom.pmf(k,n,failure_rate) #probability mass function
print(probabilities)
[5.98736939e-01 3.15124705e-01 7.46347985e-02 1.04750594e-02 9.64808106e-04 6.09352488e-05 2.67259863e-06 8.03789063e-08 1.58642578e-09 1.85546875e-11 9.76562500e-14]
print('Probability that none of the items are defective is ',round(probabilities[0],3))
print('Probability that exactly one of the items is defective is ',round(probabilities[1],3))
print('Probability that two or fewer of the items are defective is ',round((probabilities[0]+probabilities[1]+probabilities[2]),3))
print('Probability that three or more of the items are defective is ',round((1- (probabilities[0]+probabilities[1]+probabilities[2])),3))
Probability that none of the items are defective is 0.599 Probability that exactly one of the items is defective is 0.315 Probability that two or fewer of the items are defective is 0.988 Probability that three or more of the items are defective is 0.012
Question: A car salesman sells on an average 3 cars per week.
A. Probability that in a given week he will sell some cars.
B. Probability that in a given week he will sell 2 or more but less than 5 cars.
C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold perweek.
avg_cars_sold = 3
some_cars = np.arange(0,11)
prob = stats.poisson.pmf(some_cars,avg_cars_sold).round(3)
# Answer A
print('Probability salesman will sell no cars is', prob[0])
print('Probability salesman will sell 1 car is', prob[1])
print('Probability salesman will sell some cars = probability of selling 2 cars or more = ',round(1-prob[0]-prob[1],3))
Probability salesman will sell no cars is 0.05 Probability salesman will sell 1 car is 0.149 Probability salesman will sell some cars = probability of selling 2 cars or more = 0.801
# Answer B
prob4 = stats.poisson.cdf(4,avg_cars_sold)
prob1 = stats.poisson.cdf(1,avg_cars_sold)
print('Probability salesman will sell 2 or more cars but less than 5 cars is', round(prob4-prob1,2))
Probability salesman will sell 2 or more cars but less than 5 cars is 0.62
#Answer C
cdf = stats.poisson.cdf(np.arange(0,11),avg_cars_sold)
plt.plot(np.arange(0,11),cdf,'o-')
plt.title('Poisson: $\lambda$ = %i' % avg_cars_sold)
plt.xlabel("No. of cars sold per week ")
plt.ylabel('Cumulative Probability of Number of cars sold');
Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.
A. What is the probability that all three orders will be recognised correctly?
B. What is the probability that none of the three orders will be recognised correctly?
C. What is the probability that at least two of the three orders will be recognised correctly?
prob_reco = stats.binom.pmf(np.arange(0,4),3,0.868)
print('Probability that all three orders will be recognized correctly is ', round(prob_reco[3],3))
print('Probability that none of the three orders will be recognized correctly is ', round(prob_reco[0],3))
print('Probability that at least two of the three orders will be recognized correctly is ', round(prob_reco[2]+prob_reco[3],3))
Probability that all three orders will be recognized correctly is 0.654 Probability that none of the three orders will be recognized correctly is 0.002 Probability that at least two of the three orders will be recognized correctly is 0.952
A group of 300 professionals sat for a competitive exam. The results show the information of marks obtained by them have a mean of 60 and a standard deviation of 12. The pattern of marks follows a normal distribution. Answer the following questions.
A. What is the percentage of students who score more than 80.
B. What is the percentage of students who score less than 50.
C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?
perc_more_than_80 = 1 - stats.norm.cdf(80,60,12)
perc_less_than_50 = stats.norm.cdf(50,60,12)
print('Percentage of students who score more than 80 is ', round(perc_more_than_80,3))
print('Percentage of students who score less than 50 is ', round(perc_less_than_50,3))
Percentage of students who score more than 80 is 0.048 Percentage of students who score less than 50 is 0.202
cut_off = 0
for i in range(73,90):
if stats.norm.cdf(i,60,12)>= 0.9:
cut_off = i-1
break
print('Cut-off for awarding distinction to highest 10% students is', cut_off)
print(f'Students getting above {cut_off}, will be awarded distinction')
Cut-off for awarding distinction to highest 10% students is 75 Students getting above 75, will be awarded distinction
Explain 1 real life industry scenario [other than the ones mentioned above] where you can use the concepts learnt in this module of Applied statistics to get a data driven business solution.
A scenario where poisson's distribution can be used is for staffing decisions in a large retail outlet.
Staffing decisions at different at the retail outlet at different times would depend on the probable (pre-decided cut-off for probability) lowest number of customers/sales at the different times of the day (calculated based on an observed average number of customers/sales at the different times) and whether this calculated value of sales covers the cost of wages for the employees.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go
import plotly.offline as pyo
from plotly.offline import init_notebook_mode
df = pd.read_csv('DS - Part2 - Basketball.csv', index_col = 'Team')
# Converting TeamLaunch into four digit year
df.TeamLaunch = df.TeamLaunch.apply(lambda year: year[0:4])
# Imputing zeros in place of '-'
def impute_zero(x):
if x == '-':
return 0
else:
return int(x)
df['TournamentChampion'] = df['TournamentChampion'].apply(lambda wins: impute_zero(wins))
df['Runner-up'] = df['Runner-up'].apply(lambda seconds: impute_zero(seconds))
# Removing last row since it does not data for most columns
df1 = df[0:60]
# Considering current year as 2021
df1['YearsSinceLaunch'] = df1['TeamLaunch'].apply(lambda y: 2021 - int(y))
C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Renaming the Columns
df1.columns = ['Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'RunnerUp', 'TeamLaunch', 'HighestPositionHeld', 'YearsSinceLaunch']
# Converting str columns to integer
df1['Score'] = df1.Score.apply(lambda y: int(y))
df1['PlayedGames'] = df1.PlayedGames.apply(lambda y: int(y))
df1['WonGames'] = df1.WonGames.apply(lambda y: int(y))
df1['DrawnGames'] = df1.DrawnGames.apply(lambda y: int(y))
df1['LostGames'] = df1.LostGames.apply(lambda y: int(y))
df1['BasketScored'] = df1.BasketScored.apply(lambda y: int(y))
df1['BasketGiven'] = df1.BasketGiven.apply(lambda y: int(y))
df1['HighestPositionHeld'] = df1.HighestPositionHeld.apply(lambda y: int(y))
C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Creating a column to capture number of times teams finish in the top two in a tournament
df1['toptwo'] = df1.TournamentChampion + df1.RunnerUp
C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Creating columns for Average Scores per Tournament,Average Scores per Game, Number of wins per game played
def avg(a,b):
return a/b
df1['AvgScorePerTour'] = df1[['Score','Tournament']].apply(lambda df1: avg(df1['Score'],df1['Tournament']),axis=1)
df1['AvgScorePerGame'] = df1[['Score','PlayedGames']].apply(lambda df1: avg(df1['Score'],df1['PlayedGames']),axis=1)
df1['AvgWinPerGame'] = df1[['WonGames','PlayedGames']].apply(lambda df1: avg(df1['WonGames'],df1['PlayedGames']),axis=1)
C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Creating a column for Difference in basket scored and basket given
def diff(a,b):
return (a-b)
df1['DifferenceInBasket'] = df1[['BasketScored','BasketGiven']].apply(lambda df1: diff(df1['BasketScored'], df1['BasketGiven']),axis=1)
C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Final DataFrame for Analysis
df1
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | YearsSinceLaunch | toptwo | AvgScorePerTour | AvgScorePerGame | AvgWinPerGame | DifferenceInBasket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Team | ||||||||||||||||||
| Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 92 | 56 | 50.988372 | 1.587618 | 0.596307 | 2807 |
| Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 92 | 50 | 49.558140 | 1.543085 | 0.572411 | 2786 |
| Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | 92 | 18 | 43.025000 | 1.316756 | 0.474751 | 1225 |
| Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | 90 | 12 | 41.292683 | 1.271021 | 0.445571 | 929 |
| Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | 92 | 15 | 39.162791 | 1.219406 | 0.437726 | 931 |
| Team 6 | 73 | 2819 | 2408 | 990 | 531 | 887 | 3680 | 3373 | 1 | 4 | 1934 | 1 | 87 | 5 | 38.616438 | 1.170681 | 0.411130 | 307 |
| Team 7 | 82 | 2792 | 2626 | 948 | 608 | 1070 | 3609 | 3889 | 0 | 0 | 1929 | 3 | 92 | 0 | 34.048780 | 1.063214 | 0.361005 | -280 |
| Team 8 | 70 | 2573 | 2302 | 864 | 577 | 861 | 3228 | 3230 | 2 | 3 | 1929 | 1 | 92 | 5 | 36.757143 | 1.117724 | 0.375326 | -2 |
| Team 9 | 58 | 2109 | 1986 | 698 | 522 | 766 | 2683 | 2847 | 0 | 1 | 1939 | 2 | 82 | 1 | 36.362069 | 1.061934 | 0.351460 | -164 |
| Team 10 | 51 | 1884 | 1728 | 606 | 440 | 682 | 2159 | 2492 | 1 | 0 | 1932 | 1 | 89 | 1 | 36.941176 | 1.090278 | 0.350694 | -333 |
| Team 11 | 45 | 1814 | 1530 | 563 | 392 | 575 | 2052 | 2188 | 1 | 5 | 1941 | 1 | 80 | 6 | 40.311111 | 1.185621 | 0.367974 | -136 |
| Team 12 | 51 | 1789 | 1698 | 586 | 389 | 723 | 2278 | 2624 | 0 | 0 | 1939 | 4 | 82 | 0 | 35.078431 | 1.053592 | 0.345112 | -346 |
| Team 13 | 42 | 1471 | 1466 | 463 | 384 | 619 | 1767 | 2180 | 0 | 0 | 1948 | 4 | 73 | 0 | 35.023810 | 1.003411 | 0.315825 | -413 |
| Team 14 | 44 | 1416 | 1428 | 453 | 336 | 639 | 1843 | 2368 | 0 | 1 | 1929 | 2 | 92 | 1 | 32.181818 | 0.991597 | 0.317227 | -525 |
| Team 15 | 43 | 1389 | 1458 | 471 | 358 | 629 | 1753 | 2152 | 0 | 1 | 1944 | 2 | 77 | 1 | 32.302326 | 0.952675 | 0.323045 | -399 |
| Team 16 | 37 | 1351 | 1318 | 426 | 327 | 565 | 1500 | 1834 | 0 | 0 | 1935 | 4 | 86 | 0 | 36.513514 | 1.025038 | 0.323217 | -334 |
| Team 17 | 36 | 1314 | 1255 | 390 | 330 | 535 | 1421 | 1763 | 0 | 0 | 1949 | 4 | 72 | 0 | 36.500000 | 1.047012 | 0.310757 | -342 |
| Team 18 | 38 | 1174 | 1192 | 408 | 292 | 492 | 1642 | 1951 | 0 | 0 | 1933 | 3 | 88 | 0 | 30.894737 | 0.984899 | 0.342282 | -309 |
| Team 19 | 27 | 1148 | 988 | 333 | 256 | 399 | 1182 | 1371 | 0 | 0 | 1960 | 3 | 61 | 0 | 42.518519 | 1.161943 | 0.337045 | -189 |
| Team 20 | 33 | 1020 | 1096 | 367 | 242 | 487 | 1347 | 1746 | 0 | 1 | 1951 | 2 | 70 | 1 | 30.909091 | 0.930657 | 0.334854 | -399 |
| Team 21 | 17 | 970 | 646 | 266 | 172 | 208 | 892 | 789 | 0 | 1 | 1998 | 2 | 23 | 1 | 57.058824 | 1.501548 | 0.411765 | 103 |
| Team 22 | 23 | 667 | 742 | 218 | 175 | 349 | 819 | 1157 | 0 | 0 | 1941 | 6 | 80 | 0 | 29.000000 | 0.898922 | 0.293801 | -338 |
| Team 23 | 17 | 662 | 652 | 189 | 148 | 305 | 760 | 1088 | 0 | 0 | 1977 | 8 | 44 | 0 | 38.941176 | 1.015337 | 0.289877 | -328 |
| Team 24 | 21 | 606 | 678 | 203 | 180 | 295 | 750 | 1022 | 0 | 0 | 1959 | 5 | 62 | 0 | 28.857143 | 0.893805 | 0.299410 | -272 |
| Team 25 | 12 | 553 | 456 | 147 | 112 | 197 | 520 | 633 | 0 | 0 | 2004 | 6 | 17 | 0 | 46.083333 | 1.212719 | 0.322368 | -113 |
| Team 26 | 20 | 538 | 628 | 184 | 149 | 295 | 716 | 1050 | 0 | 0 | 1935 | 5 | 86 | 0 | 26.900000 | 0.856688 | 0.292994 | -334 |
| Team 27 | 13 | 510 | 494 | 155 | 128 | 211 | 619 | 744 | 0 | 0 | 1961 | 5 | 60 | 0 | 39.230769 | 1.032389 | 0.313765 | -125 |
| Team 28 | 18 | 445 | 586 | 145 | 143 | 298 | 607 | 992 | 0 | 0 | 1940 | 11 | 81 | 0 | 24.722222 | 0.759386 | 0.247440 | -385 |
| Team 29 | 12 | 421 | 380 | 125 | 81 | 174 | 458 | 623 | 0 | 0 | 1930 | 6 | 91 | 0 | 35.083333 | 1.107895 | 0.328947 | -165 |
| Team 30 | 11 | 416 | 402 | 113 | 95 | 194 | 430 | 632 | 0 | 0 | 1963 | 6 | 58 | 0 | 37.818182 | 1.034826 | 0.281095 | -202 |
| Team 31 | 12 | 375 | 423 | 123 | 102 | 198 | 422 | 581 | 0 | 0 | 1974 | 7 | 47 | 0 | 31.250000 | 0.886525 | 0.290780 | -159 |
| Team 32 | 14 | 353 | 426 | 129 | 95 | 202 | 492 | 720 | 0 | 0 | 1943 | 4 | 78 | 0 | 25.214286 | 0.828638 | 0.302817 | -228 |
| Team 33 | 12 | 343 | 448 | 104 | 127 | 217 | 393 | 662 | 0 | 0 | 1977 | 12 | 44 | 0 | 28.583333 | 0.765625 | 0.232143 | -269 |
| Team 34 | 9 | 293 | 346 | 96 | 92 | 158 | 291 | 489 | 0 | 0 | 1987 | 7 | 34 | 0 | 32.555556 | 0.846821 | 0.277457 | -198 |
| Team 35 | 11 | 285 | 334 | 103 | 79 | 152 | 419 | 588 | 0 | 0 | 1941 | 4 | 80 | 0 | 25.909091 | 0.853293 | 0.308383 | -169 |
| Team 36 | 7 | 277 | 270 | 76 | 76 | 118 | 320 | 410 | 0 | 0 | 1991 | 7 | 30 | 0 | 39.571429 | 1.025926 | 0.281481 | -90 |
| Team 37 | 6 | 242 | 228 | 62 | 56 | 110 | 244 | 366 | 0 | 0 | 2007 | 8 | 14 | 0 | 40.333333 | 1.061404 | 0.271930 | -122 |
| Team 38 | 9 | 230 | 282 | 82 | 63 | 137 | 285 | 430 | 0 | 0 | 1962 | 5 | 59 | 0 | 25.555556 | 0.815603 | 0.290780 | -145 |
| Team 39 | 4 | 190 | 160 | 52 | 45 | 63 | 199 | 241 | 0 | 0 | 1994 | 10 | 27 | 0 | 47.500000 | 1.187500 | 0.325000 | -42 |
| Team 40 | 5 | 188 | 186 | 50 | 46 | 90 | 202 | 296 | 0 | 0 | 1978 | 8 | 43 | 0 | 37.600000 | 1.010753 | 0.268817 | -94 |
| Team 41 | 6 | 168 | 204 | 59 | 50 | 95 | 216 | 310 | 0 | 0 | 1971 | 12 | 50 | 0 | 28.000000 | 0.823529 | 0.289216 | -94 |
| Team 42 | 6 | 150 | 180 | 53 | 44 | 83 | 165 | 221 | 0 | 0 | 1963 | 7 | 58 | 0 | 25.000000 | 0.833333 | 0.294444 | -56 |
| Team 43 | 4 | 148 | 152 | 37 | 37 | 78 | 155 | 253 | 0 | 0 | 1999 | 17 | 22 | 0 | 37.000000 | 0.973684 | 0.243421 | -98 |
| Team 44 | 3 | 132 | 114 | 35 | 27 | 52 | 139 | 167 | 0 | 0 | 2014 | 10 | 7 | 0 | 44.000000 | 1.157895 | 0.307018 | -28 |
| Team 45 | 7 | 107 | 130 | 43 | 21 | 66 | 227 | 308 | 0 | 0 | 1929 | 3 | 92 | 0 | 15.285714 | 0.823077 | 0.330769 | -81 |
| Team 46 | 3 | 96 | 114 | 26 | 44 | 44 | 101 | 139 | 0 | 0 | 1990 | 9 | 31 | 0 | 32.000000 | 0.842105 | 0.228070 | -38 |
| Team 47 | 4 | 91 | 116 | 34 | 16 | 66 | 181 | 295 | 0 | 0 | 1947 | 7 | 74 | 0 | 22.750000 | 0.784483 | 0.293103 | -114 |
| Team 48 | 2 | 83 | 80 | 20 | 23 | 37 | 62 | 117 | 0 | 0 | 1996 | 17 | 25 | 0 | 41.500000 | 1.037500 | 0.250000 | -55 |
| Team 49 | 2 | 81 | 80 | 19 | 24 | 37 | 70 | 115 | 0 | 0 | 1995 | 19 | 26 | 0 | 40.500000 | 1.012500 | 0.237500 | -45 |
| Team 50 | 4 | 76 | 108 | 30 | 16 | 62 | 145 | 252 | 0 | 0 | 1945 | 10 | 76 | 0 | 19.000000 | 0.703704 | 0.277778 | -107 |
| Team 51 | 3 | 71 | 90 | 29 | 13 | 48 | 121 | 183 | 0 | 0 | 1953 | 14 | 68 | 0 | 23.666667 | 0.788889 | 0.322222 | -62 |
| Team 52 | 4 | 56 | 72 | 21 | 14 | 37 | 153 | 184 | 0 | 0 | 1929 | 6 | 92 | 0 | 14.000000 | 0.777778 | 0.291667 | -31 |
| Team 53 | 2 | 52 | 68 | 17 | 18 | 33 | 71 | 116 | 0 | 0 | 1979 | 10 | 42 | 0 | 26.000000 | 0.764706 | 0.250000 | -45 |
| Team 54 | 3 | 42 | 54 | 18 | 6 | 30 | 97 | 131 | 0 | 0 | 1929 | 8 | 92 | 0 | 14.000000 | 0.777778 | 0.333333 | -34 |
| Team 55 | 2 | 40 | 68 | 13 | 14 | 41 | 70 | 182 | 0 | 0 | 1950 | 16 | 71 | 0 | 20.000000 | 0.588235 | 0.191176 | -112 |
| Team 56 | 1 | 35 | 38 | 8 | 11 | 19 | 36 | 55 | 0 | 0 | 2016 | 17 | 5 | 0 | 35.000000 | 0.921053 | 0.210526 | -19 |
| Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 | 12 | 0 | 34.000000 | 0.894737 | 0.210526 | -28 |
| Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 | 65 | 0 | 22.000000 | 0.733333 | 0.233333 | -20 |
| Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 | 70 | 0 | 19.000000 | 0.633333 | 0.233333 | -34 |
| Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 | 66 | 0 | 14.000000 | 0.466667 | 0.166667 | -31 |
df1[df1.YearsSinceLaunch == df1.YearsSinceLaunch.max()].index
Index(['Team 1', 'Team 2', 'Team 3', 'Team 5', 'Team 7', 'Team 8', 'Team 14',
'Team 45', 'Team 52', 'Team 54'],
dtype='object', name='Team')
# 10 of the newest teams from newest to oldest
df1.sort_values(['YearsSinceLaunch'])[0:10].index
Index(['Team 56', 'Team 44', 'Team 57', 'Team 37', 'Team 25', 'Team 43',
'Team 21', 'Team 48', 'Team 49', 'Team 39'],
dtype='object', name='Team')
print('Team with most number of Tournament wins is',df1.TournamentChampion.idxmax())
print('Team with most number top two finishes is',df1.toptwo.idxmax())
print('Team with most number of games won is', df1.WonGames.idxmax())
print('Team with most number baskets scored is', df1.BasketScored.idxmax())
print('Team with maximum difference between baskets scored and baskets given is', df1.DifferenceInBasket.idxmax())
print('Team with maximum percentage of games won is', df1.AvgWinPerGame.idxmax())
print('Team with maximum average score per game is', df1.AvgScorePerGame.idxmax())
print('Team with maximum average score per Tournament is', df1.AvgScorePerTour.idxmax())
Team with most number of Tournament wins is Team 1 Team with most number top two finishes is Team 1 Team with most number of games won is Team 1 Team with most number baskets scored is Team 1 Team with maximum difference between baskets scored and baskets given is Team 1 Team with maximum percentage of games won is Team 1 Team with maximum average score per game is Team 1 Team with maximum average score per Tournament is Team 21
print('Team with least number of games won is', df1.WonGames.idxmin())
print('Team with least number baskets scored is', df1.BasketScored.idxmin())
print('Team with worst performance basis baskets scored vs baskets given is', df1.DifferenceInBasket.idxmin())
print('Team with minimum percentage of games won is', df1.AvgWinPerGame.idxmin())
print('Team with minimum average score per game is', df1.AvgScorePerGame.idxmin())
print('Team with minimum average score per Tournament is', df1.AvgScorePerTour.idxmin())
Team with least number of games won is Team 60 Team with least number baskets scored is Team 60 Team with worst performance basis baskets scored vs baskets given is Team 14 Team with minimum percentage of games won is Team 60 Team with minimum average score per game is Team 60 Team with minimum average score per Tournament is Team 52
layout = go.Layout(title="Number of Tournaments Played by Teams", xaxis=dict(title="Teams"),
yaxis=dict(title="Number of Tournaments Played"))
fig = go.Figure(data=go.Bar(y=df1['Tournament'], x = df.index), layout = layout)
fig.show()
layout = go.Layout(title="Total Scores by team", xaxis=dict(title="Teams"),
yaxis=dict(title="Total Scores"))
fig = go.Figure(data=go.Bar(y=df1['Score'], x = df.index), layout = layout)
fig.show()
layout = go.Layout(title="Number of Games Played by Teams", xaxis=dict(title="Teams"),
yaxis=dict(title="Number of Games Played"))
fig = go.Figure(data=go.Bar(y=df1['PlayedGames'], x = df.index), layout = layout)
fig.show()
layout = go.Layout(title="Number of Games Won by Teams", xaxis=dict(title="Teams"),
yaxis=dict(title="Number of Games Won"))
fig = go.Figure(data=go.Bar(y=df1['WonGames'], x = df.index), layout = layout)
fig.show()
layout = go.Layout(title="Number of Baskets Scored by Teams", xaxis=dict(title="Teams"),
yaxis=dict(title="Number of Baskets Scored"))
fig = go.Figure(data=go.Bar(y=df1['BasketScored'], x = df.index), layout = layout)
fig.show()
layout = go.Layout(title="Number of Top Two Finishes by Teams", xaxis=dict(title="Teams"),
yaxis=dict(title="Number of Top Two Finishes"))
fig = go.Figure(data=go.Bar(y=df1['toptwo'], x = df.index), layout = layout)
fig.show()
layout = go.Layout(title="Time Since Launch by Teams", xaxis=dict(title="Teams"),
yaxis=dict(title="Time Since Launch in years"))
fig = go.Figure(data=go.Bar(y=df1['YearsSinceLaunch'], x = df.index), layout = layout)
fig.show()
# Analysis using original variables
df2 = df1[['Tournament','Score','PlayedGames','WonGames','BasketScored','YearsSinceLaunch']]
sns.pairplot(df2)
<seaborn.axisgrid.PairGrid at 0x1c468216940>
df1.columns
Index(['Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'RunnerUp', 'TeamLaunch', 'HighestPositionHeld', 'YearsSinceLaunch',
'toptwo', 'AvgScorePerTour', 'AvgScorePerGame', 'AvgWinPerGame',
'DifferenceInBasket'],
dtype='object')
# Analysis using new variables
df3 = df1[['YearsSinceLaunch', 'AvgScorePerTour', 'AvgScorePerGame', 'AvgWinPerGame', 'DifferenceInBasket']]
sns.pairplot(df3)
<seaborn.axisgrid.PairGrid at 0x1c46986b9e8>
layout = go.Layout(title="Time Since Launch vs Average Score Per Tournament", xaxis=dict(title="Average Score Per Tournament"),
yaxis=dict(title="Time Since Launch in years"))
fig = go.Figure(data=go.Scatter(x=df1['AvgScorePerTour'], y=df1['YearsSinceLaunch'], mode='markers',hovertext = df1.index), layout = layout)
fig.show()
df1[(df1['AvgScorePerTour']>40) & (df1['Tournament']>=4) & (df1['YearsSinceLaunch']<30)]
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | YearsSinceLaunch | toptwo | AvgScorePerTour | AvgScorePerGame | AvgWinPerGame | DifferenceInBasket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Team | ||||||||||||||||||
| Team 21 | 17 | 970 | 646 | 266 | 172 | 208 | 892 | 789 | 0 | 1 | 1998 | 2 | 23 | 1 | 57.058824 | 1.501548 | 0.411765 | 103 |
| Team 25 | 12 | 553 | 456 | 147 | 112 | 197 | 520 | 633 | 0 | 0 | 2004 | 6 | 17 | 0 | 46.083333 | 1.212719 | 0.322368 | -113 |
| Team 37 | 6 | 242 | 228 | 62 | 56 | 110 | 244 | 366 | 0 | 0 | 2007 | 8 | 14 | 0 | 40.333333 | 1.061404 | 0.271930 | -122 |
| Team 39 | 4 | 190 | 160 | 52 | 45 | 63 | 199 | 241 | 0 | 0 | 1994 | 10 | 27 | 0 | 47.500000 | 1.187500 | 0.325000 | -42 |
layout = go.Layout(title="Time Since Launch vs Win Percentage", xaxis=dict(title="Win Percentage"),
yaxis=dict(title="Time Since Launch in years"))
fig = go.Figure(data=go.Scatter(x=df1['AvgWinPerGame']*100, y=df1['YearsSinceLaunch'], mode='markers',hovertext = df1.index), layout = layout)
fig.show()##### From the above plot Team 21, Team 39, Team 25 and Team 44 look good contenders. However, it is important to test for a minimum number of games played.
df1[(df1['AvgWinPerGame']>0.3) & (df1['PlayedGames']>100) & (df1['YearsSinceLaunch']<30)]
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | YearsSinceLaunch | toptwo | AvgScorePerTour | AvgScorePerGame | AvgWinPerGame | DifferenceInBasket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Team | ||||||||||||||||||
| Team 21 | 17 | 970 | 646 | 266 | 172 | 208 | 892 | 789 | 0 | 1 | 1998 | 2 | 23 | 1 | 57.058824 | 1.501548 | 0.411765 | 103 |
| Team 25 | 12 | 553 | 456 | 147 | 112 | 197 | 520 | 633 | 0 | 0 | 2004 | 6 | 17 | 0 | 46.083333 | 1.212719 | 0.322368 | -113 |
| Team 39 | 4 | 190 | 160 | 52 | 45 | 63 | 199 | 241 | 0 | 0 | 1994 | 10 | 27 | 0 | 47.500000 | 1.187500 | 0.325000 | -42 |
| Team 44 | 3 | 132 | 114 | 35 | 27 | 52 | 139 | 167 | 0 | 0 | 2014 | 10 | 7 | 0 | 44.000000 | 1.157895 | 0.307018 | -28 |
df1['AvgLossPerGame'] = df1[['LostGames','PlayedGames']].apply(lambda df1: avg(df1['LostGames'],df1['PlayedGames']),axis=1)
C:\Users\Kunal\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
layout = go.Layout(title="Time Since Launch vs Loss Percentage", xaxis=dict(title="Loss Percentage"),
yaxis=dict(title="Time Since Launch in years"))
fig = go.Figure(data=go.Scatter(x=df1['AvgLossPerGame']*100, y=df1['YearsSinceLaunch'], mode='markers',hovertext = df1.index), layout = layout)
fig.show()
layout = go.Layout(title="Time Since Launch vs Win Percentage upon Loss Percentage", xaxis=dict(title="Win/Loss"),
yaxis=dict(title="Time Since Launc"))
fig = go.Figure(data=go.Scatter(x=df1['AvgWinPerGame']/df1['AvgLossPerGame'], y=df1['YearsSinceLaunch'], mode='markers',hovertext = df1.index), layout = layout)
fig.show()
layout = go.Layout(title="Positive Basket Difference", xaxis=dict(title="Basket Difference"),
yaxis=dict(title="Time Since Launch in years"))
fig = go.Figure(data=go.Scatter(x=df1['DifferenceInBasket'], y=df1['YearsSinceLaunch'], mode='markers',hovertext = df1.index), layout = layout)
fig.show()
Data was incomplete for Team 61. The Association should try and get complete data
The Data was not formatted correctly. For example we had dashes in place of 0 or NA. for the TeamLaunch column the year format was not consistent. The recommendation is to capture data in the right and consisted format
The Column Names were inconsistent. For example the format of column name RunnerUp was initially Runner-up
The data seemed to be incorrect for some of the older teams. Although they had been launched much earlier, the no. of games played etc were much less. This needs to be checked.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('DS - Part3 - CompanyX_EU.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
All six columns are object datatype
type(df.Startup[0]),type(df.Product[0]),type(df.Funding[1]),type(df.Event[0]),type(df.Result[0]),type(df.OperatingState[0])
(str, str, str, str, str, str)
The Data in the all columns are in the string format
df[df.Product.isna()] #Rows where column Product has null values
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 257 | IMO | NaN | NaN | TC50 2009 | Contestant | Operating |
| 286 | keenkong | NaN | NaN | Disrupt NYC 2010 | Contestant | Operating |
| 318 | LocalHero Inc. | NaN | NaN | Disrupt SF 2011 | Contestant | Operating |
| 359 | MotherKnows | NaN | NaN | Disrupt NYC 2011 | Contestant | Operating |
| 445 | Rainbow | NaN | $3.3M | Disrupt SF 2013 | Contestant | Operating |
| 565 | Tello Applications | NaN | $4.7M | Disrupt SF 2010 | Contestant | Closed |
df[df.Funding.isna()] #Rows where column Funding has null values
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
| 7 | 8020 Media | 8020media.com | NaN | TC40 2007 | Contestant | Operating |
| 8 | About Last Night | aboutlastnight.net | NaN | Disrupt NYC 2012 | Contestant | Operating |
| ... | ... | ... | ... | ... | ... | ... |
| 648 | YOOBIC | yoobic.com | NaN | Disrupt London 2015 | Finalist | Operating |
| 651 | YourVersion | yourversion.com | NaN | TC50 2009 | Audience choice | Operating |
| 652 | Yuwan | iyuwan.cn | NaN | Disrupt Beijing 2011 | Contestant | Closed |
| 653 | ZAP! | zapreklam.com/ | NaN | Disrupt EU 2014 | Audience choice | Operating |
| 655 | Zello | zello.com | NaN | TC40 2007 | Contestant | Operating |
214 rows × 6 columns
df.dropna(inplace = True)
df.reset_index(inplace = True) # Resetting the index values
# Getting the unique alphabets at the end of the Funding string
c = []
for i in df.Funding:
c.append(i[-1])
np.unique(np.array(c))
array(['B', 'K', 'M'], dtype='<U1')
# Creating a function to convert funding feature to numerical value
def conv_to_num(a):
if a[-1] == 'K':
m = 1000
elif a[-1]=='M':
m = 1000000
else:
m = 1000000000
b = float(a[1:-1])*m
return b
df['Funding'] = df['Funding'].apply(lambda a: conv_to_num(a))
# getting the funds in million in one array # This can also be done by subsetting the dataframe
d = []
for i in df.Funding:
if (i>999999) and (i<999999999):
d.append(i)
d = np.array(d)
import plotly
import plotly.graph_objects as go
import plotly.offline as pyo
from plotly.offline import init_notebook_mode
layout = go.Layout(title="Distribution of Funds in Millions", xaxis=dict(title="Funding"),
yaxis=dict(title="Funds in Millions"))
fig = go.Figure(data=go.Box(y=d), layout = layout )
fig.show()
Minimum value is $ 1 Million
Upper Fence is $ 35.5 Million
c = 0
e = []
for i in d:
if i>35500000:
c = c+1
print('Number of outliers greater than Upper Fence is', c)
Number of outliers greater than Upper Fence is 33
e = []
for i in d:
if i <=35500000:
e.append(i)
e = np.array(e)
len(e)
276
layout = go.Layout(title="Distribution of Funds in Millions till $35.5 Million", xaxis=dict(title="Funding"),
yaxis=dict(title="Funds in Millions"))
fig = go.Figure(data=go.Box(y=e), layout = layout )
fig.show()
df.OperatingState.value_counts()
Operating 319 Acquired 66 Closed 57 Ipo 4 Name: OperatingState, dtype: int64
layout = go.Layout(title="Distribution plot for Funds in Millions", xaxis=dict(title="Funds in Millions"),
yaxis=dict(title="Count"))
bins = plotly.graph_objects.histogram.XBins(size=10000000)
fig = go.Figure(data=go.Histogram(x = d, xbins = bins ), layout = layout )
fig.show()
layout = go.Layout(title="Distribution plot for Funding of Companies still Operating", xaxis=dict(title="Funds in Millions"),
yaxis=dict(title="Count"))
#bins = plotly.graph_objects.histogram.XBins(size=10000000)
fig = go.Figure(data=go.Histogram(x = df[df.OperatingState =='Operating'].Funding), layout = layout )
fig.show()
layout = go.Layout(title="Distribution plot for Funding of Companies that closed", xaxis=dict(title="Funds in Millions"),
yaxis=dict(title="Count"))
bins = plotly.graph_objects.histogram.XBins(size=1000000)
fig = go.Figure(data=go.Histogram(x = df[df.OperatingState =='Closed'].Funding, xbins = bins), layout = layout )
fig.show()
from statsmodels.stats.anova import anova_lm
from scipy import stats
Null Hypothesis (Ho): There is no significant difference between the funding raised by companies that are still operating and those that are closed.
Alternate Hypthesis (Ha): There is a significant difference between the funding raised by companies that are still operating and those that are closed.
# Checking for significance with 95% confidence
# We would reject the null hypothesis if p < 0.05
F,p = stats.f_oneway(df[df['OperatingState'] =='Operating']['Funding'], df[df['OperatingState'] =='Closed']['Funding'])
print(F,p)
1.2957097053103792 0.2557270188562704
since the p value is greater than 0.05 we fail to reject the null hypothesis.
Therefore we conclude that there is no significant difference between the funding raised by companies that are still operating and those that are closed
df2 = pd.read_csv('DS - Part3 - CompanyX_EU.csv')
df2['Result'].value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
num_win_still_op = len(df2[(df2.Result == 'Winner') & (df2.OperatingState =='Operating')])
perc_win_still_op = 100*num_win_still_op/df2['Result'].value_counts()[3]
num_cont_still_op = len(df2[(df2.Result == 'Contestant') & (df2.OperatingState =='Operating')])
perc_cont_still_op = 100*num_cont_still_op/df2['Result'].value_counts()[0]
print('Percentage of contestants still operating is',round(perc_cont_still_op,2))
print('\nPercentage of winners still operating is',round(perc_win_still_op,2))
Percentage of contestants still operating is 68.03 Percentage of winners still operating is 73.08
Null Hypothesis (Ho): There is no significant difference between the proportion of winners that are still operating and the proportion of contestants that are still operating
Alternate Hypthesis (Ha): There is a significant difference between the proportion of winners that are still operating and the proportion of contestants that are still operating
from statsmodels.stats.proportion import proportions_ztest
significance = 0.025
# our samples - 19 out of 26 winners and 332 out of 488 contestants
successes = np.array([num_win_still_op, num_cont_still_op])
samples = np.array([df2['Result'].value_counts()[3], df2['Result'].value_counts()[0]])
Z, p = proportions_ztest(count=successes, nobs=samples, alternative='two-sided')
print(round(Z,3),round(p,3))
0.539 0.59
Since the pvalue is greater than 0.025 therefore we fail to reject the null hypothesis. Which means that there is no significant difference between the proportion of winners that are still operating and the proportion of contestants that are still operating
Being a winner did not give any significant advantage to the winning company in terms of continuing operations from the competition time to the time of recording the data
df2['Event'].value_counts()
TC50 2008 52 TC50 2009 50 TC40 2007 40 Disrupt NYC 2011 32 Disrupt NYC 2013 31 Disrupt SF 2013 31 Disrupt SF 2011 31 Disrupt NYC 2012 30 Disrupt SF 2012 30 Disrupt SF 2014 28 Disrupt SF 2015 27 Disrupt NYC 2014 27 Disrupt SF 2016 26 Disrupt NY 2015 26 Disrupt SF 2010 26 Disrupt NY 2016 22 Disrupt NYC 2010 22 Disrupt Beijing 2011 18 Hardware Battlefield 2015 15 Disrupt London 2015 15 Disrupt EU 2014 15 Disrupt EU 2013 15 Hardware Battlefield 2014 14 Hardware Battlefield 2016 13 - 13 Disrupt London 2016 13 Name: Event, dtype: int64
def select(a):
if (a[0:7]=='Disrupt'):
if (int(a[-2:])>=13):
return 'Yes'
else:
return 'No'
else:
return 'No'
df2['Selected_Event'] = df2['Event'].apply(lambda y: select(y))
df_disrupt = df2[df2['Selected_Event']=='Yes']
df_disrupt
| Startup | Product | Funding | Event | Result | OperatingState | Selected_Event | |
|---|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating | Yes |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | Yes |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating | Yes |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | Yes |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 646 | YayPay Inc | yaypay.com | $900K | Disrupt London 2015 | Contestant | Operating | Yes |
| 648 | YOOBIC | yoobic.com | NaN | Disrupt London 2015 | Finalist | Operating | Yes |
| 653 | ZAP! | zapreklam.com/ | NaN | Disrupt EU 2014 | Audience choice | Operating | Yes |
| 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | Yes |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | Yes |
276 rows × 7 columns
df_disrupt.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 276 entries, 0 to 660 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 276 non-null object 1 Product 275 non-null object 2 Funding 194 non-null object 3 Event 276 non-null object 4 Result 276 non-null object 5 OperatingState 276 non-null object 6 Selected_Event 276 non-null object dtypes: object(7) memory usage: 17.2+ KB
df_disrupt = df_disrupt.dropna()
df_disrupt
| Startup | Product | Funding | Event | Result | OperatingState | Selected_Event | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | Yes |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | Yes |
| 13 | Agrilyst | agrilyst.com | $1M | Disrupt SF 2015 | Winner | Operating | Yes |
| 14 | Aiden | aiden.ai | $750K | Disrupt London 2016 | Contestant | Operating | Yes |
| 16 | Aircall | aircall.io | $11.6M | Disrupt SF 2015 | Contestant | Operating | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 641 | Workspot | workspot.com | $15.8M | Disrupt NYC 2013 | Contestant | Operating | Yes |
| 642 | Xendo | xendo.com | $28K | Disrupt SF 2014 | Contestant | Acquired | Yes |
| 646 | YayPay Inc | yaypay.com | $900K | Disrupt London 2015 | Contestant | Operating | Yes |
| 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | Yes |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | Yes |
193 rows × 7 columns
df_disrupt = df_disrupt.reset_index()
df_disrupt
| index | Startup | Product | Funding | Event | Result | OperatingState | Selected_Event | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | Yes |
| 1 | 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | Yes |
| 2 | 13 | Agrilyst | agrilyst.com | $1M | Disrupt SF 2015 | Winner | Operating | Yes |
| 3 | 14 | Aiden | aiden.ai | $750K | Disrupt London 2016 | Contestant | Operating | Yes |
| 4 | 16 | Aircall | aircall.io | $11.6M | Disrupt SF 2015 | Contestant | Operating | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188 | 641 | Workspot | workspot.com | $15.8M | Disrupt NYC 2013 | Contestant | Operating | Yes |
| 189 | 642 | Xendo | xendo.com | $28K | Disrupt SF 2014 | Contestant | Acquired | Yes |
| 190 | 646 | YayPay Inc | yaypay.com | $900K | Disrupt London 2015 | Contestant | Operating | Yes |
| 191 | 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | Yes |
| 192 | 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | Yes |
193 rows × 8 columns
# Marking Events as per location
df_disrupt['Location'] = df_disrupt['Event'].apply(lambda y: y[8:-5])
df_disrupt
| index | Startup | Product | Funding | Event | Result | OperatingState | Selected_Event | Location | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | Yes | NYC |
| 1 | 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | Yes | NY |
| 2 | 13 | Agrilyst | agrilyst.com | $1M | Disrupt SF 2015 | Winner | Operating | Yes | SF |
| 3 | 14 | Aiden | aiden.ai | $750K | Disrupt London 2016 | Contestant | Operating | Yes | London |
| 4 | 16 | Aircall | aircall.io | $11.6M | Disrupt SF 2015 | Contestant | Operating | Yes | SF |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188 | 641 | Workspot | workspot.com | $15.8M | Disrupt NYC 2013 | Contestant | Operating | Yes | NYC |
| 189 | 642 | Xendo | xendo.com | $28K | Disrupt SF 2014 | Contestant | Acquired | Yes | SF |
| 190 | 646 | YayPay Inc | yaypay.com | $900K | Disrupt London 2015 | Contestant | Operating | Yes | London |
| 191 | 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | Yes | NYC |
| 192 | 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | Yes | SF |
193 rows × 9 columns
df_disrupt = df_disrupt.drop(columns=['index'])
df_disrupt
| Startup | Product | Funding | Event | Result | OperatingState | Selected_Event | Location | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | Yes | NYC |
| 1 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | Yes | NY |
| 2 | Agrilyst | agrilyst.com | $1M | Disrupt SF 2015 | Winner | Operating | Yes | SF |
| 3 | Aiden | aiden.ai | $750K | Disrupt London 2016 | Contestant | Operating | Yes | London |
| 4 | Aircall | aircall.io | $11.6M | Disrupt SF 2015 | Contestant | Operating | Yes | SF |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188 | Workspot | workspot.com | $15.8M | Disrupt NYC 2013 | Contestant | Operating | Yes | NYC |
| 189 | Xendo | xendo.com | $28K | Disrupt SF 2014 | Contestant | Acquired | Yes | SF |
| 190 | YayPay Inc | yaypay.com | $900K | Disrupt London 2015 | Contestant | Operating | Yes | London |
| 191 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | Yes | NYC |
| 192 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | Yes | SF |
193 rows × 8 columns
np.unique(df_disrupt['Location'])
array(['EU', 'London', 'NY', 'NYC', 'SF'], dtype=object)
df_disrupt['Funding'] = df_disrupt['Funding'].apply(lambda a: conv_to_num(a))
df_disrupt
| Startup | Product | Funding | Event | Result | OperatingState | Selected_Event | Location | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3DLT | 3dlt.com | 630000.0 | Disrupt NYC 2013 | Contestant | Closed | Yes | NYC |
| 1 | 3Dprintler | 3dprintler.com | 1000000.0 | Disrupt NY 2016 | Audience choice | Operating | Yes | NY |
| 2 | Agrilyst | agrilyst.com | 1000000.0 | Disrupt SF 2015 | Winner | Operating | Yes | SF |
| 3 | Aiden | aiden.ai | 750000.0 | Disrupt London 2016 | Contestant | Operating | Yes | London |
| 4 | Aircall | aircall.io | 11600000.0 | Disrupt SF 2015 | Contestant | Operating | Yes | SF |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188 | Workspot | workspot.com | 15800000.0 | Disrupt NYC 2013 | Contestant | Operating | Yes | NYC |
| 189 | Xendo | xendo.com | 28000.0 | Disrupt SF 2014 | Contestant | Acquired | Yes | SF |
| 190 | YayPay Inc | yaypay.com | 900000.0 | Disrupt London 2015 | Contestant | Operating | Yes | London |
| 191 | Zenefits | zenefits.com | 583600000.0 | Disrupt NYC 2013 | Finalist | Operating | Yes | NYC |
| 192 | Zula | zulaapp.com | 3400000.0 | Disrupt SF 2013 | Audience choice | Operating | Yes | SF |
193 rows × 8 columns
Null Hypothesis (Ho): There is no significant difference between the funding raised by companies across NY, SF and EU events.
Alternate Hypthesis (Ha): There is a significant difference between the funding raised by companies across NY, SF and EU events.
# Checking for significance with 95% confidence
# We would reject the null hypothesis if p < 0.05
F,p = stats.f_oneway(df_disrupt[df_disrupt['Location'] =='NY']['Funding'],df_disrupt[df_disrupt['Location'] =='SF']['Funding'], df_disrupt[df_disrupt['Location'] =='EU']['Funding'])
print(round(F,3),round(p,3))
0.907 0.406
since the p value is greater than 0.05 we fail to reject the null hypothesis.
Therefore we conclude that there is no significant difference between the funding raised by companies in events across NY, SF and EU
data_three_cities = df_disrupt[(df_disrupt['Location']=='NY') | (df_disrupt['Location']=='SF') | (df_disrupt['Location']=='EU')]
sns.boxplot(data = data_three_cities, x = data_three_cities['Location'],y = data_three_cities['Funding']);
layout = go.Layout(title="Comparison of Funding between NY, EU and SF ", xaxis=dict(title="Locations"),
yaxis=dict(title="Funds in Millions"))
fig = go.Figure(data=go.Box(y=data_three_cities['Funding'],x=data_three_cities['Location']), layout = layout )
fig.show()